Whole-genome variants dataset of 209 local chickens from China

Compared to commercial chickens, local breeds exhibit better in meat quality and flavour, but the productivity (e.g., growth rate, body weight) of local chicken breeds is rather low. Genetic analysis based on whole-genome sequencing contributes to elucidating the genetic markers or putative candidate genes related to some economic traits, facilitating the improvement of production performance, the acceleration of breeding progress, and the conservation of genetic resources. Here, a total of 209 local chickens from 13 breeds were investigated, and the observation of approximately 91.4% high-quality sequences (Q30 > 90%) and a mapping rate over 99% for each individual indicated good results of this study, as confirmed by a genome coverage of 97.6%. Over 19 million single nucleotide polymorphisms (SNPs) and 1.98 million insertion-deletions (InDels) were identified using the reference genome (GRCg7b), further contributing to the public database. This dataset provides valuable resources for studying genetic diversity and adaptation and for the cultivation of new chicken breeds/lines.


Background & Summary
Chickens are among the most important farm animals supplying eggs, meat, and other products to humans.After long-term domestication and human-driven selection, hundreds of distinctive chicken breeds are now cultivated worldwide, and chicken meat has been the largest meat resource since 2019 1 .In particular, more than one hundred local breeds have been identified in China, constituting almost half of the broiler market.Compared to commercial broilers (e.g., Arbor Acre, Ross, and Cobb), local chickens exhibit markedly improved meat quality and flavour, adaptation to the environment, and disease resistance 2,3 .However, local breeds have not undergone intense artificial selection for productivity traits, which may explain their lower production performance compared to commercial breeds 4,5 .However, compared to pigs or cattle, chickens are more efficient and environmentally friendly livestock.Their feed conversion ratio is ten times that of cattle, and the carbon emissions of broilers are only 1/10 of those of cattle 6 .Therefore, it is necessary to investigate genomic markers and genetic mechanisms related to economic traits via whole-genome sequencing to bridge the gap in production performance and accelerate breeding progress 7 .
Simultaneously, domestic chickens are desirable models for investigating genetic adaptation and diversity and disease-related markers due to the advantages of their short reproductive cycle, small body size, and identical ancestors.After domestication from red jungle fowl (Gallus gallus spadiceus) 8 , chickens were cultivated with the aim of meeting various human demands, and meat-type (e.g., Xiaoshan chicken), egg-type (e.g., Xianju chicken), ornamental-type (e.g., Silkie chicken), and game fowl breeds (e.g., Luxi gamecocks, Henan gamecocks) were developed [9][10][11][12] , as well as the commercial broilers (e.g., broiler line B, Cobb) 13 .Therefore, it is relatively easy to unambiguously infer the driving factors of phenotypic or behavioural changes in these chickens.Additionally, domestic chickens can be used to generate special populations, such as F2 populations 14,15 and advanced intercross lines 16 , which contributes to exploring the genetic mechanisms of chicken complex traits and provides new insights into genomic breeding.
Genomic analysis can reveal the demographic history of different chicken breeds and reconstruct gene flow among them, which contributes to a better understanding of the domestic history and potential mechanisms of some economic traits, such as breast muscle yield 17 , yellow skin 4,18 .The analysis of genome-wide variants in chickens distributed in different regions can be used to investigate genetic adaptation and diversity, especially adaptation to environmental conditions, such as altitude, temperature, and anoxic environments [19][20][21][22] .Moreover, selective sweep analysis based on single nucleotide polymorphisms (SNPs) is an effective method for identifying genetic markers and mechanisms underlying chicken production performance, reproduction, immunity, etc.This approach provides important insights into modern breeding systems [20][21][22][23][24][25] .In addition to SNPs and insertion-deletions (InDels), genome-wide sequencing has also been used to identify copy number variants (CNVs) 26 and structural variants (SVs) 27 , although the detection rate and efficiency of SV calling are relatively low compared to those of PacBio sequencing 28 .Therefore, genomic analysis using the whole-genome variants of local chickens is an effective approach for elucidating genetic diversity and selective signatures during long-term domestication.
This study provides a whole-genome sequencing dataset from hundreds of local chickens (n = 209) of 13 local chicken breeds in China, including meat-type, egg-type, and ornamental-type chickens.A total of more Fig. 1 The workflow of the library preparation, sequencing, genome mapping, and variant calling and filtration.The pipeline was consistent with the GATK recommended protocol for variant identification.Table 2. Summary statistics for genome alignment analysis 1 . 1 All the terms below were calculated in each chicken breed; values represent means over individual sequences per breed. 2 The coverage (1×) was calculated by the ratio of total length of mapped reads (mapped by at least 1 read) to the genome length. 3The coverage (4×) was calculated by the ratio of total length of mapped reads (mapped by at least 4 reads) to the genome length. 4Indicated all terms were calculated in all individuals and presented in average value.to the chicken variant database and plays a crucial role in reconstructing the demographic and domestication history of chickens.

Methods
Sampling.Blood sampling from 13 chicken breeds was performed in Zhejiang Province, China.The following breeds were included (Supplement Table S1 Genomic DNa extraction and quality control.The workflow from sampling to variant filtration is shown in Fig. 1.Genomic DNA was extracted from blood samples using the phenol-chloroform method.DNA quality control was performed as follows: 1) DNA degradation and contamination were monitored on 1% agarose gels; 2) the OD 260/280 ratio was determined with a NanoDrop instrument to check the purity of the DNA; and 3) the DNA concentration was measured with a Qubit ® DNA Assay Kit on a Qubit ® 3.0 Fluorometer (Invitrogen, USA).Finally, more than 0.2 μg of DNA fragments with no degradation or contamination and an OD value of 1.8~2.0 were used for library construction.
Library preparation and sequencing.A sequencing library was created with the NEB Next UltraTM DNA Library Prep Kit for Illumina (NEB, USA).A mean size of 350 bp was achieved by shearing the genomic DNA.Before sequencing, the DNA fragments were subsequently submitted to end polishing, A-tailing, and adapter addition.PCR amplification and purification were then carried out using an AMPure XP system (Beckman Coulter, Beverly, USA).Using an Agilent 2100 Bioanalyzer and Qubit 3.0 Fluorometer (Invitrogen, USA), the quality of the library was evaluated based on the insert size and DNA concentration.Real-time PCR (>2 nM) was used to quantify the results.Ultimately, a flow cell containing the qualifying DNA nanospheres was filled, and a DNBSEQ-T7 platform was used for sequencing.

Variant density, genetic diversity, and polymorphism information content (PIC) estimation.
The variant density was calculated as the ratio of the total length of the genome (with N bases removed) to the variant number.The SNP/InDel density was calculated and visualized with the CMplot package 36 .The genetic diversity (π) was calculated using VCFtools v0.1.13 33with window size of 50 kb.The PIC was estimated using the following formula: , where p i indicates the frequency of the minor allele of SNP i and q i indicates the frequency of the major allele of SNP i .

Data Records
Raw FASTQ files for whole-genome sequencing were deposited in the NCBI Sequence Read Archive (SRA) and have been assigned BioProject accession number PRJNA942350 (https://identifiers.org/ncbi/insdc.sra:SRP426730) 37 .The raw VCF file containing SNPs and InDels was deposited in the EVA database with accession number PRJEB71347 (https://identifiers.org/ena.embl:PRJEB71347) 38.The quality control results of raw The number of SNPs within 1Mb window size  reads and annotation files of variants have been deposited in the the Figshare database with the following digital object identifier: https://doi.org/10.6084/m9.figshare.24751956.v2 39.The relationship between the chicken ID in the VCF files and the SRA database was shown in Supplement Table S2.

Technical Validation
Quality control of sequencing results.For each individual, an average of 17.3~28.5Gb of raw data from were obtained, with 91.4% of the data achieving a Phred quality score of 30 (indicating sequencing accuracy of 99.9%) on average (Table 1).A stable GC content (42.6%) was demonstrated for the sequence (Table 1, Supplement Table S3).A genome mapping rate greater than 99% and an average genome coverage of 97.1% (with N bases removed) were obtained (Table 2, Supplement Table S4).
Filtration of SNP and InDel datasets.The joint calling strategy was used in this procedure, and a total of 27 million raw SNPs and 2.75 million raw InDels were identified in the population of 13 chicken breeds.To exclude low-quality variants, we used the VariantFiltration function in GATK software 32 .The specific standards used are described in the Methods section above.After the first round of quality control, the SNPs and InDels were further filtered using VCFtools v0.1.13software 33 , and only biallelic variants with a minor allele frequency of 0.01 and a minimum sequencing depth of 3 were retained.We calculated statistics for SNP types, and T:A > C:G-type mutations were mainly identified in this population (Fig. 2). Figure 3 shows the relationships among SNP quality, supported read number, SNP percentage, and neighbouring SNP distance.We detected a positive correlation between SNP quality and percentage, and most SNPs were supported by at least 20 reads (Fig. 3a,b).This indicates the high quality of the identified SNPs.

Summary statistics of SNPs and InDels across the whole genome. The high-quality SNPs and
InDels were distributed across the genome with an average density of 1 SNP every 52 bases and 1 InDel every 521 bases (Table 3).Figures 3c, 4a show the high density of SNPs across the whole genome.The density of InDels was relatively low (Fig. 4b).Chromosomes 16, 29-32, and 34-38 were defined as dot chromosomes, while chromosomes 22, 25 and 33 were defined as microchromosomes 40 .We found that the SNP density on these chromosomes exhibited a polarized distribution (Table 3).Based on variants annotation, SNPs/InDels were mainly distributed in intronic and intergenic regions (Fig. 5).And more SNPs rather than InDels were located in exons (Fig. 5a).The PIC and genetic diversity π of each breed are shown in Fig. 6, the genomes of the SF and YD chickens exhibited higher polymorphism, and LH chickens were found significant lower genetic diversity.

Usage Notes
Whole-genome sequencing allows us to obtain the SNPs and InDels across the whole genome using the bioinformatic pipeline of this study.However, the CNVs and SVs were not included in the current study.As same as the previous report in human and chickens 8,[41][42][43] , we did not consider the ploidy of variants in sex chromosome ZW.And the variants in chromosome W were discarded due to the uncertain gaps in chromosome.Although SNPs are more widely used for investigating investigate the genetic diversity and dissecting the genetic mechanism of economic traits, the heritability of traits of interest cannot be fully explained by SNPs, namely, missing heritability.These traits are also influenced by epigenetic factors.The effect of missing heritability could be relieved by increasing the sample size and sequencing depth and by considering the epistatic effects and SVs and CNVs, which also contributed to the identification of additional novel loci 44 .Additionally, we aligned the sequencing data to the reference genome (GRCg7b, GCF_016699485.2), which was assembled based on one broiler and did not include all the variants in the reference genome.Therefore, the data produced from this study may produce an incomplete explanation of the genetic background due to the missing alignment.Overall, our data provide insights into the evaluation of chicken population structure, and are an efficient dataset for identifying the CNVs and SVs.These data can be used in the construction of chicken pangenomes or graph pangenomes together with the Pacbio and Oxford Nanopore Technology data.

Fig. 2
Fig. 2 Statistics for the SNP number of different mutation types.

Fig. 3
Fig. 3 Statistics for the SNP percent in support reads number (a), quality (b), neighbouring SNP distance (c).The different colour indicated the various individual.

Fig. 4 Fig. 5
Fig. 4 Distribution of SNP and InDel across the whole-genome of 13 local chicken breeds.(a) SNP density statistics across the whole-genome.(b) InDel density statistics across the whole-genome.

Fig. 6
Fig. 6 Estimation of genomic PIC (a) and π (b) based on SNPs of 13 chicken breeds.

Table 1 .
1ummary statistics for quality control of sequences 1 .1Allthe terms below were calculated in each chicken breed; values represent means over individual sequences per breed. 2 Indicated all terms were calculated in all individuals and presented in average value.
putative regions of positive selection, inferring demographic history, analysing gene flow, detecting candidate genes related to economic traits, determining genetic adaptation to local environmental factors, discovering breed-specific variants or markers, analysing genetic diversity, or developing SNP genotyping arrays for use in chicken breeding systems or species identification.Moreover, the whole-genome variants of most chicken breeds (except for Beijing You and Silky-feather chickens) included in the present study have not been reported.Therefore, this dataset provides an ideal resource for population genetics and evolutionary analyses.Furthermore, this database represents an important supplement

Table 3 .
1ummary statistics of SNPs and InDels in each chromosome 1 .1Thevariant density was calculated by the ratio of chromosome length to variant number.